Estimation of population parameters using sample extremes from nonconstant sample sizes

We examine the accuracy and precision of parameter estimates for both the exponential and normal distributions when using only a collection of sample extremes. That is, we consider a collection of random variables, where each of the random variables is either the minimum or maximum of a sample of nj independent, identically distributed random variables drawn from a normal or exponential distribution with unknown parameters. Previous work derived estimators for the population parameters assuming the nj sample sizes are constant. Since sample sizes are often not constant in applications, we derive new unbiased estimators that take into account the varying sample sizes. We also perform simulations to assess how the previously derived estimators perform when the constant sample size is simply replaced with the average sample size. We explore how varying the mean, standard deviation, and probability distribution of the sample sizes affects the estimation error. Overall, our results demonstrate that using the average sample size in place of the constant sample size still results in reliable estimates for the population parameters, especially when the average sample size is large. Our estimation framework is applied to a biological example involving plant pollination.


Introduction
We consider a data set where each observation is the minimum or maximum of a sample of independent, identically distributed (iid) random variables drawn from a population with unknown parameters. That is, our data consists of W j ¼ minfX ij g n j i¼1 or Y j ¼ maxfX ij g n j i¼1 for j = 1, . . ., m. Here m represents the number of known sample extremes, while n j represents the sample size of the jth sample over which the minimum or maximum is computed. While the X ij are iid draws from a population with unknown parameters, the X ij values themselves are not directly observable. Rather, we consider the case where it is only possible to measure the minimum or maximum of each sample. We seek to use the minimum and maximum values in order to estimate the unknown population parameters.
In [1], estimators for the exponential and normal distributions were derived assuming the n j sample sizes are constant. Those two population distributions were considered since they arise frequently in applications and since they allow for explicit derivations. However, constant sample sizes are unrealistic to occur in applications. For example, [2,3] examined the fertilization process in the flowering plant Arabidopsis thaliana. Their experiments begin by placing pollen on the stigma of the plant; they then wait a certain time period, kill the plant, and image it in order to view the fertilization progress. In the images, the researchers are able to view the pollen tubes, which grow down from the stigma towards the plant ovules. When a pollen tube reaches an ovule, it can fertilize it, which will then develop into a seed. The average fertilization speed is of biological interest and can be calculated by dividing the average pollen tube length by the amount of time lapsed since the pollen were placed on the stigma. However, due to the high density of pollen tubes, it is only feasible for the researchers to measure the longest pollen tube within each plant, and yet they wish to use these longest measurements to estimate the overall average length. Hence, their data fits our framework of estimating a population parameter with sample extremes. The n j sample sizes in this application represent the number of pollen tubes within the jth plant. It is not practical for researchers to place the same exact number of pollen grains on each plant, and hence the n j values will naturally vary. While [1] estimated the average pollen tube length by assuming a constant sample size for the number of pollen tubes within each plant, we seek to improve the estimate by deriving a new estimator for the population mean that allows for the samples sizes themselves to be random variables.
In Section 2, which focuses on the case where the population is exponentially distributed, we derive new estimators for the population mean under increasingly more realistic assumptions for the n j sample sizes. We also derive the variances of our new estimators, and consider cases where the n j sample sizes follow either a uniform or Poisson distribution. We compare the accuracy and precision of our new estimators to the original estimator computed in [1] under the constant sample size assumption. When the population is normally distributed, it is not feasible to analytically derive unbiased estimators for the population parameters using sample extremes with nonconstant sample sizes. Hence, in Section 3, we focus solely on analyzing the performance of the original estimators from [1] for the normal distribution when the constant sample size assumption is violated and the average sample size is instead used. In Section 4, we compare our methodology to that of maximum likelihood estimation. Lastly, in Section 5, we apply our results to the plant pollination example.

Estimator for exponential distribution
In this section we assume that X ij � iid ExpðbÞ where β is the unknown population mean. We set for j = 1, . . ., m.

Sample sizes n j are all equal to a known value n
In [1], it was shown that in the case of constant sample size n, the estimator is an unbiased estimator for β with where While the prior work did not consider the case of a sample of sample minima, it follows directly from properties of the exponential distribution [4] and the methods in [1] that the estimatorb is also an unbiased estimator for β with While there exists an unbiased estimator using either sample maxima or sample minima, we observe from Eqs (2) and (4) that the variance using sample maxima is strictly smaller with varðb 1 Þ ! 0 at rate proportional to 1 ðlog nÞ 2 as n ! 1, while varðb 1 Þ is constant with respect to n. Hence, more precise estimates for the exponential distribution can be obtained from using maximum values compared to using minimum values.

Sample sizes n j are unequal, but are all known values
As mentioned in Section 1, it is unrealistic for the sample sizes to be equal in applications. Hence, we begin to improve the estimation framework by allowing the sample sizes to vary, but with the restriction that the sample sizes are known values. It follows directly from properties of the exponential distribution [4] and the methods in [1] that the estimator with variance and the estimatorb with variance are unbiased for estimating the population mean β. The estimators introduced in Eqs (5) and (7) reduce to the previous estimators given in Eqs (1) and (3) when all the n j are equal. Hence, these estimators are simply generalizations that allow them to be utilized in a broader range of applications. Yet, in some applications, not only do the sample sizes vary, but the sample sizes are also unknown. For example, in the plant pollination application described in Section 1, hundreds of pollen grains are placed on each plant and it is not possible for the researchers to count the exact numbers. Hence, the sample sizes are unknown, but the researchers do have an idea of the probability distribution of the sample sizes. In the following subsection, we further improve the estimation framework by considering the case where the sample sizes are themselves random variables.

Sample sizes n j are random with known distribution
We now suppose that the n j sample sizes are iid random variables. We assume that the n j values are unknown, but come from some known probability distribution, such as the uniform or Poisson distribution. It immediately follows from the proof of Theorem 1 in [1] that By properties of conditional expectation and conditional variance [4], is unbiased for estimating β. Likewise, Hence, the estimatorb is also unbiased for estimating β.
We observe from comparing Eqs (10) and (12) with Eqs (2) and (4), respectively, that the variance of the estimators is higher when the sample sizes are random variables compared to when the sample sizes are constant. This relationship is unsurprising since it is intuitive that higher variation in the sample sizes will result in higher variation in the estimation of the population mean.
In order to actually calculate the estimatorsb 3 andb 3 , the values of E½H n j � and E 1 then Or if the n j follow a Poisson distribution (shifted to start at 1), i.e. n j À 1 � iid PoissonðlÞ with known λ, then E[n j ] = λ + 1,

Sample sizes n j are random with only known mean
If the full probability distribution of the n j is not known, but the average sample size, E[n j ], is known, it would be reasonable to approximate E½H n j � and E 1 n j h i with H E½n j � and 1 E½n j � , respectively.
Using these approximations yields the following estimators: Since i is a concave function of n j , E½H n j � < H E½n j � by Jensen's inequality [5]. Hence,b 4 is a biased estimator of β that tends to underestimate the true value of β. Conversely, E½n j � , which makesb 4 a biased estimator that tends to overestimate the true value of β.
Note thatb 4 andb 4 are equivalent to using the original estimatorsb 1 andb 1 and replacing the constant sample size n with E[n j ]. In the next subsection, we compare the performance of the various estimators.

Comparison of estimators
The estimatorsb 2 ,b 3 , andb 4 based upon a sample of sample maxima all reduce to the estimatorb 1 when the sample sizes n j are all equal to a constant value n. Likewise, the estimatorsb 2 , b 3 , andb 4 based upon a sample of sample minima all reduce to the estimatorb 1 in the case of known equal sample sizes. As mentioned previously, the estimatorsb 2 andb 2 allow for unequal sample sizes, but still require that the sample sizes are known values. On the other hand, the estimatorsb 3 ,b 3 ,b 4 , andb 4 allow for the sample sizes to be unknown iid random variables, which is the most realistic scenario to occur in applications. 4 all allow for random sample sizes,b 3 andb 3 require that the full probability distribution of the sample sizes is known, whileb 4 andb 4 only require the average sample size to be known. Thus, the estimatorsb 4 andb 4 , which are equivalent to the original estimatorsb 1 andb 1 when the constant sample size is simply replaced with the average sample size, have the advantage of applying in the broadest range of circumstances. Yet, they have the disadvantage of being the only biased estimators presented here.
Fig 1 illustrates the ratio betweenb 3 andb 4 , which is equal to the ratio between 1=E½H n j � and 1=H E½n j � , as well as the ratio betweenb 3 andb 4 , which is equal to the ratio between 1=E 1 n j h i and E[n j ], in the case where the sample sizes follow a uniform distribution. We observe in Fig   1 that the ratio when using sample maxima (b 3 =b 4 ) is always greater than one sinceb 4 tends to underestimate the true population mean. In contrast, the ratio when using sample minima (b 3 =b 4 ) is always less than one sinceb 4 tends to overestimate the true population mean. In either case, the ratio between the two estimators is further from one when the width of the uniform distribution is larger since a wider interval results in more variability in the sample sizes. However, the ratio of the two estimators converges to one as the average sample size E[n j ] increases. The convergence is faster when using sample maxima compared to when using sample minima. Similar patterns can be observed when the sample sizes follow other probability distributions, such as the Poisson distribution. Overall, Fig 1 illustrates that althoughb 4 and b 4 , which simply use the average sample size, are biased, they can still result in good estimates, especially when the average sample size is large.

Estimators for normal distribution
We now consider the case where X ij � iid Normal with unknown mean μ and unknown variance σ 2 . We again set W j ¼ minfX ij g n j i¼1 and Y j ¼ maxfX ij g n j i¼1 for j = 1, . . ., m.

Sample sizes n j are all equal to a known value n
The estimators given in [1] when using a sample of maximum values and assuming a constant sample size n arem Here � Y and S 2 Y denote the sample mean and sample variance, respectively, of the Y j 's, while k n denotes the mean and c n denotes the variance of the maximum of n iid Normal(0, 1) random variables. It was shown thatŝ 2 is an unbiased estimator for σ 2 , but that EðmÞ > m with EðmÞ ! m as m ! 1.
Although [1] did not consider the case of a sample of minimum values, we can define the analogous estimatorsm where now � W and S 2 W denote the sample mean and sample variance, respectively, of the W j 's. The constants k n and c n appear in both sets of estimators since the mean of the minimum of n iid Normal(0, 1) random variables is equal to the negative of the mean of the maximum of n iid Normal(0, 1) random variables, while the variance is the same in both cases. The values for k n and c n can be approximated either analytically or through simulations [6].
Due to the symmetry of the normal distribution, the estimators using minimum values will have equivalent performance to the estimators using maximum values. More specifically, it follows from symmetry thats 2 is also an unbiased estimator for σ 2 and that EðmÞ < m with EðmÞ ! m as m ! 1. This equivalent performance using either sample maxima or sample minima is in stark contrast to the case when the population is exponentially distributed. For an exponential population, it was shown in Section 2 that it is more advantageous to estimate the population parameters using a sample of maximum values due to the significantly smaller variance of the estimators in the maxima setting compared to the minima setting.

Sample sizes n j are random with known mean
Due to the complexity of the normal probability density function, it is not feasible to derive unbiased estimators for μ and σ 2 in the case where the sample sizes are unequal known values or are random variables with a known distribution. However, if the n j values are iid random variables with known average sample size, E[n j ], it is reasonable to replace the k n and c n values in Eqs (15) and (16) with k E½n j � and c E½n j � , respectively.
We evaluate the performance of the estimators when the constant sample size n is replaced with the average sample size E[n j ] through simulations. Although the true population mean μ and true population variance σ 2 are unknown, we set them equal to 0 and 1, respectively, for our simulations so that we can explicitly compare in our simulations how closem andŝ 2 are to the true values. True values other than 0 or 1 would shift and rescale the distribution, but would not affect the relative performance of the estimators. Our simulations focus onm and s 2 , which are based on sample maxima, but due to the symmetry of the normal distribution,m ands 2 , which are based on sample minima, would have analogous performance.

Comparison to maximum likelihood estimation
Suppose X ij are drawn iid from an arbitrary population with probability density function f(x| Θ) and cumulative distribution function F(x|Θ), where Θ is the vector of unknown population parameters. Let W j ¼ minfX ij g n j i¼1 represent the sample minima and Y j ¼ maxfX ij g n j i¼1 represent the sample maxima, for j = 1, . . ., m. When the sample sizes n j are known, then the likelihood function is equal to n j Fðy j jYÞ n j À 1 f ðy j jYÞ ð17Þ when using the sample maxima, and it is equal to n j ð1 À Fðw j jYÞÞ n j À 1 f ðw j jYÞ ð18Þ when using the sample minima. When the sample sizes n j are not known values, but rather are random variables drawn iid from a population with probability mass function p(n), then the likelihood function is equal to when using the sample maxima, and it is equal to when using the sample minima. Maximum likelihood estimation chooses to estimate Θ with the value that maximizes the likelihood function L(Θ). This value of Θ, called the maximum likelihood estimator or MLE, is often found by differentiating the logarithm of the likelihood function and setting the derivative equal to zero. However, in many cases the equation cannot be solved exactly, and numerical methods must be used in order to approximate the maximum likelihood estimator.
In [7], data from high-voltage power lines was collected to estimate the variance of the corona noise (often heard as a crackling or hissing sound in power lines). Due to limitations of the instruments, only the maximum value of the corona noise was recorded in each three second time lapse, but each maximum value was computed over a sample of size exactly 400. An explicit formula for the maximum likelihood estimator was derived in [7] for the framework of data consisting of sample maxima with equal sample sizes. However, their formula is dependent upon their specific assumption for the population distribution of the corona noise in power lines.
When the population distribution of the X ij is exponential, as considered in Section 2, then the maximum likelihood estimator can be computed explicitly when using sample minima with known sample sizes, and it is equivalent to our estimator given in Eq (3) when the sample sizes are all equal and is equivalent to our estimator given in Eq (7) when the sample sizes are unequal. When the population is exponentially distributed with known sample sizes, but the data consists of sample maxima, the maximum likelihood estimator cannot be solved for explicitly due to the complexity of the likelihood function. Likewise, when the population is normally distributed or when the sample sizes are unknown with some probability distribution, explicit formulas cannot be found for the maximum likelihood estimators. However, numerical methods can be used to approximate the maximum likelihood estimators in these cases. Table 1 compares the estimates for the mean β of the exponential distribution when using our formula described in Eq (10) versus when approximating the maximum likelihood estimator using the built-in mle function in Matlab for a custom likelihood function of the form given in Eq (19). Likewise, Table 2 compares the estimates for the mean μ and variance σ 2 of the normal distribution when using our formulas described in Eq (17) versus when using maximun likelihood estimation. In all cases, 100 simulations were run, each with m = 10 sample maxima and with the raw data drawn from either an exponential distribution with true β = 3 or a normal distribution with true μ = 0 and σ 2 = 1. The n j values were simulated from five possible probability distributions, all with E[n j ] set equal to 100. The tables display the mean and standard deviation of the parameter estimates across the 100 simulations for each case, as well as the mean square error (MSE). The mean square error is calculated as the bias (mean of the estimates minus the true parameter value) squared plus the variance of the estimates. Changing the values of β, μ, σ 2 , m, and E[n j ] shifts and rescales the results, but does not change the relative performance of the estimates from our formulas compared to maximum likelihood estimation.
The results displayed in Tables 1 and 2 indicate that the parameter estimates from our formulas tend to be centered closer to the true values on average, whereas maximum likelihood estimation consistently overestimates β and μ, while consistently underestimating σ 2 . When the population is exponentially distributed, our formulas produce similar variability in the parameter estimates and a substantially smaller mean square error compared to maximum likelihood estimation. When the population is normally distributed, maximum likelihood estimation produces smaller variability in the parameter estimates and a slightly smaller mean square error compared to our formulas. However, the fact that our formulas almost always result in less bias indicates that our approach is a valuable alternative to maximum likelihood estimation in both cases. Moreover, the key advantage of our methodology is that we have presented explicit formulas for estimators that can be computed quickly by simply plugging in sample data.

Biological application
We return now to the example described in Section 1 regarding estimating the mean pollen tube length based upon measurements of the longest pollen tube length in a sample of plants.
In the laboratory experiments performed by Swanson et al., either m = 8 or m = 9 individual plants were used for each time point (3, 6, 9, or 24 hours) for each accession (Columbia or Landsberg) of Arabidopsis thaliana. In [1], the mean pollen tube length was estimated assuming a constant n = 933 number of pollen tubes within each plant of the Columbia accession and a constant n = 727 number of pollen tubes within each plant of the Landsberg accession.
The estimation was performed assuming that the individual pollen tube lengths are exponentially distributed, which was shown to be a reasonable fit for the data. The number of pollen tubes is not in fact constant across all the plants, with the 933 and 727 values actually representing the average number of pollen tubes within each plant for the Columbia and Landsberg accessions, respectively. Data from [2] indicates that it is reasonable to assume that the number of pollen tubes within each plant follows a uniform distribution, with range (640, 1226) for the Columbia accession and range (342, 1112) for the Landsberg accession. Hence, we apply our estimation framework derived in Section 2.3 to estimate the overall mean pollen tube length taking into account the probability distribution of the sample sizes. The resulting estimates, along with the original estimates assuming a constant sample size, are listed in Table 3.
The original estimates assuming a constant sample size n are equivalent to the estimates that arise when the sample sizes are random and we simply use the average sample size E[n j ] in place of n. However, it was shown in Section 2.4 that simply replacing the constant sample size with the mean sample size leads to a biased estimator that tends to underestimate the true population mean. Thus, although the two sets of estimates are fairly similar, the new estimates, which take into account the distribution of the sample sizes, are more likely to be closer to the true mean pollen tube lengths.